October 25, 2019

Why I am here

Do as I say, not as I do

  • MEEE 1996
  • MSc Economics 1998
  • Think tank economist 1998–2000
  • PhD in Statistics 2005 (missing data in spatio-temporal models)
  • Tenure track appointment at Mizzou 2005–2008
  • Stata developer 2010–2012
  • Survey statistician 2012–present

How is industry experience different from academia?

  • what is valued
  • teams
  • communication
  • reporting and accountability

My role

  • survey statistics
  • all of statistics
  • some of data science
  • workflows and reproducibility

My take on software

  • One rectangular data set, analysis leading up to a handful of regresion tables: Stata
  • Develop and deploy algorithms on web: Python, Java
  • Produce reports: R

Building blocks of my workflow

  • RMarkdown
  • R tidyverse
  • Version control
  • Cheat sheets

Nicely packaged within RStudio

  • Google and Twitter

Markdown

Markdown

When you make your text bold, or italic, or create

  • items in lists
  • and more items

you are marking certain elements of your text to be formated in a special way. (The heading above is also marked text.)

Markdown modifies this to a very bare bones, text-only, no-mouse-selection-needed process.

https://daringfireball.net/projects/markdown/syntax

Data analysis withing a slide

## 
## Listening on http://127.0.0.1:4326

Markdown elements

`code`

_italics_

*bold*

### Heading 3

- unnumbered item

1. numbered item

R Markdown

Additionally, R and some other languages can

  • incorporate source code
  • incorporate output, such as numbers, tables, and plots

… into Markdown documents

Pew report

01:30

Parametric markdown documents

The most advanced forms of markdown documents use inputs/parameters.

  • Output format: HTML (the most flexible, and the only interactive), Word, PDF (requires LaTeX)
  • Task to perform: analysis vs. reporting the results
  • Input data files

Copy-paste vs. markdown

Tidyverse

https://www.tidyverse.org/

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Pipes to do steps in sequence

Verbs for data management tasks

What does this code do?

Pew_data %>% 
  filter(party5==1) %>%
  group_by(q1a) %>%
  summarize(n=n())

Verbs for data management tasks

What does this code do?

Pew_data %>%            # data set
  filter(party5==1) %>% # select cases
  group_by(q1a) %>%     # set up a group variable 
  summarize(n=n())      # count

Cheatsheets

Graphics

library(ggplot2)

and a myriad of extentions to it.

Tables

Group N (unweighted) MOE
Total 1,203 +/- 3.3 percentage points
Party ID
Democrats 342 +/- 6.2 percentage points
Republicans 273 +/- 6.9 percentage points
Independents 544 +/- 4.9 percentage points
Dem-leaning Indeps 199 +/- 8.3 percentage points
Rep-leaning Indeps 181 +/- 8.4 percentage points

A lot of output looks like a table already

# aimed at HTML
library(kableExtra)
# aimed specifically at MS products
library(flextable)

Version control

Version control

Code version control in R

Do something

x <- 1
y <- 2

Takeaways (hopefully)

  • Tools are important in both academia and industry
  • Reproducible workflow is important + may take some extra time to set up, pays back handsomely when you need to redo
  • No analysis without markdown
  • No code without version control
  • Presentations in Markdown

Thanks